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ABSTRACT 



A checkpoint of a paraDel program is taken in order to 
provide a consistent state of the program in the event the 
program is to be restarted. Each process of the parallel 
program is responsible for taking its own checkpoint, 
however, the timing of when the checkpoint is to be taken by 
each process is the responsibility of a coordinating process. 
During the checkpointing, various data is written to a 
checkpoint file. This data includes, for instance, in-transit 
message data, a data section, file offsets, signal state, execut- 
able information, stack contents and register contents. The 
checkpoint file can be stored either in local or global storage. 
When it is stored in global storage, migration of the program 
is facilitated. When a parallel program is to be restarted, 
each process of the program initiates its own restart. The 
restart logic restores the process to the slate at which the 
checkpoint was taken. 

20 Claims, 11 Drawing Sheets 
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METHOD OF PERFORMING CHECKPOINT/ 
RESTART OF A PARALLEL PROGRAM 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This application contains subject matter which is related 
to the subject matter of the following applications, each of 
which is assigned to the same assignee as this application 
and filed on the same day as this application. Each of the 
below listed applications is hereby incorporated herein by 
reference in its entirely: 

"A SYSTEM OF PERFORMING CHECKPOINT/ 
RESTART OF A PARALLEL PROGRAM," by Meth 
et al.. Sen No. 09/181,981; 
"PROGRAM PRODUCTS FOR PERFORMING 
CHECKPOINT/RESTART OF A PARALLEL 
PROGRAM," by Meth et al., Ser. No. 09/182,555; 
"CAPTURING AND IDENTIFYING A COMPLETE 
AND CONSISTENT SET OF CHECKPOINT FILES," 
by Meth et al,, Ser. No. 09/182,175; 
"RESTORING CHECKPOINTED PROCESSES 
INCLUDING ADJUSTING ENVIRONMENT VARI- 
ABLES OF THE PROCESSES," by Meth et al., Ser. 
No. 09/182,357; and 
"RESTORING CHECKPOINTED PROCESSES WITH- 
OUT RESTORING ATTRIBUTES OF EXTERNAL 
DATA REFERENCED BY THE PROCESSES," by 
Meth et al., issued Jul. 3, 2001 as U.S. Pat. No. 
6,256,751. 

TECHNICAL HELD 

This invention relates, in general, to processing of parallel 
programs and, in particular, to performing checkpoint and 
restart of a parallel program. 

BACKGROUND ART 

Enhancing the performance of computing environments 
continues to be a challenge for system designers, as well as 
for programmers. In order to help meet this challenge, 
parallel processing environments have been created, thereby 
setting the stage for parallel programming. 

A parallel program includes a number of processes that 
are independently executed on one or more processors. The 
processes communicate with one another via, for instance, 
messages. As the number of processors used for a parallel 
program increases, so does the likelihood of a system 
failure. Thus, il is important in a parallel processing envi- 
ronment to be able to recover efiBciently so that system 
performance is only minimally impacted. 

To facilitate recovery of a parallel program, especially a 
long running program, intermediate results of the program 
are taken at particular intervals. This is referred to as 
checkpointing the program. Checkpointing enables the pro- 
gram to be restarted from the last checkpoint, rather than 
from the beginning. 

One technique for checkpointing and restarting a program 
is described in U.S. Pat. No. 5,301^09 enUtled "Distributed 
Processing System With Checkpoint Restart Facilities 
Wherein Checkpoint Data Is Updated Only If All Processors 
Were Able To Collect New Checkpoint Data", issued on Apr. 
5, 1994. With that technique, processes external to the 
program are responsible for checkpointing and restarting the 
program. In particular, failure processing tasks detect that 
there has been a system failure. Restart processing tasks 
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execute the checkpoint restart processing in response to the 
detection of the system failure, and checkpoint processing 
tasks determine the data necessary for the restart processing. 
Thus, the external processes arc intimately involved in 
5 checkpointing and restarting the program. 

Although the above -described technique, as well as other 
techniques, have been used to checkpoint and restart 
programs, further enhancements are needed. For example, 
checkpoint/restart capabilities are needed in which the 
checkpointing and restarting of a process of a parallel 
program is handled by the process itself, instead of by 
external processes. Further, a need exists for checkpoint/ 
restart capabilities that enable the saving of interprocess 
message state and the restoring of that message state. 
Additionally, a need exists for a checkpoint capability that 
provides for the committing of a checkpoint file, so that only 
one checkpoint file for a process need be saved for restart 
purposes. Yet further, a need exists for a checkpoint capa- 
bility that allows the writing of checkpoint files to cither 
global or local storage. Further, a need exists for checkpoint/ 
restart capabilities that allow migration of the processes 
from one processor to another. 

SUMMARY OF THE INVENTION 

The shortcomings of the prior art are overcome and 
additional advantages are provided through the provision of 
a method of checkpointing parallel programs. The method 
includes, for instance, taking a checkpoint of a parallel 
program, wherein the parallel program includes a plurality 
of processes. The taking of a checkpoint includes writing, by 
a process of the plurality of processes, message data to a 
checkpoint file corresponding to the process. The message 
data includes an indication that ±ere are no messages, or it 
includes one or more in-transit messages between the pro- 
cess writing the message data and one or more other 
processes of the plurality of processes. 

In a further embodiment, the taking of a checkpoint 
further includes writing, by a process of the plurality of 
processes, a data section, a signal state and/or one or more 
file offsets to a checkpoint file corresponding to the process 
^ that is writing Ihe data section, signal state and/or file 
ofifaet(s). 

In yet a further embodiment, the taking of a checkpoint 
further includes writing, by a process of the plurality of 
processes, executable information, stack contents, and/or 
register contents to a checkpoint file corresponding to the 
process writing the executable information, the stack con- 
tents and/or the register contents. 

In another embodiment of the invention, the method 
includes restoring the process that wrote the message data to 
the checkpoint file, wherein the restoring includes copying 
the message data from the checkpoint file to memory of the 
computing unit executing the process. 

In one example, the computing unit executing the process 
55 is a different computing unit from when the checkpoint was 
taken by the process. 

In another embodiment of the invention, the taking of a 
checkpoint further includes taking a checkpoint by a number 
of processes of the plurality of the processes. The taking of 
50 a checkpoint by the number of processes includes writing 
data to a number of checkpoint files, wherein each process 
of the number of processes takes a corresponding check- 
point. 

In a further example, the taking of the corresponding 
65 checkpoints by the number of processes is coordinated. 
In another aspect of the invention, a method of restoring 
parallel programs is provided. The method includes, for 
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instance, restarting one or more processes of the parallel FIG. S depicts one embodiment of the logic associated 

program on one or more computing units, wherein at least with initiating the taking and committing of a checkpoint, in 

one of the processes is restarted on a different computing accordance with the principles of the present invention; 

unit from the computing unit that was previously used to piG. 9 depicts one embodiment of the logic associated 
take at least one checkpoint for the at least one process. 5 with taking a checkpoint, in accordance with the principles 

Further, data stored in one or more checkpoint files corre- of the present invention- 

spending to the one or more restarted processes is copied ^ embodiment of the logic associated 

into memory of the one or more computing units executmg ^^^^ ^^^^^j ^ checkpoint data previ- 

trie restarted oroce^^es * v « « 

^ ously taken, in accordance with the principles of the present 
In yet a further aspect of the invention, a method of lO invention- 

checlq,ointing parallel programs is provided. The method ^ embodiment of additional logic used 

mcludes indicaliog, by a process of a parallel program, that * u ji . _* * j '*t- • • i r 

J / . 1 i_ 1 • . .... to handle a restart, m accordance with the prmciples of the 

the process is ready to take a checkpoint; receiving, by the present invention- 
process, an indication to take the checkpoint; taking the 

checkpoint, which includes having the process copy data 15 FIG. 12 depicts one embodunent of wnting checkpomts 

from memory associated with the process to a checkpoint ^ storage, in accordance with the principles of the 

file corresponding to the process; and indicating, by the P^^^°* invention; and 

process, completion of taking the checkpoint. FIG. 13 depicts one embodiment of writing checkpoints 

In accordance with the principles of the present invention, local storage, in accordance with the principles of the 

checkpoint/restart capabilities are provided that allow the present invention. 

processes themselves to take the checkpoint and to restart ^^^^ CARRYING OUT THE 

after a failure. Additionally, m-transit messages between INVENTION 
processes (interprocess messages) or an indication that there 

are no messages is saved during the checkpointing of the In accordance with the principles of the present invention, 

program. The messages are saved without having to log the checkpoints are taken within a parallel program so that the 

messages in a log file. Further, after the processes have taken program may be restarted from an intermediate point should 

their checkpoints, the checkpoint files are committed, so that the program need to be restarted. For example, each process 

there is only one checkpoint file for each process at the time of the parallel program takes checkpoints at particular 

of restart. Yet further, the capabiUties of the present inven- intervals. These checkpoints are performed internally by the 

tion allow the writing of checkpoints to either global or local processes, however, the timing of when the checkpoints are 

storage. Additionally, migration of the processes from one to be taken is coordinated by a coordinating process. During 

system to another is allowed, when the checkpoints are the checkpointing, various data is saved. TTiis data includes, 

written to global storage. for instance, interprocess message data, signal state data, file 

Additional features and advantages are realized through offset information, register contents, executable 

the techniques of the present invention. Other embodiments information, a Data Section and/or stack contents. This data 

and aspects of the invention are described in detail herein is used in the event the parallel program needs to be 

and arc considered a part of the claimed invention. restarted. As with the checkpointing, the restarting of each 

process is handled internally by the process itself. 

BRIEF DESCRIPTION OF THE DRAWINGS 1 p \- - 

*^^^v.*v.x *^ ^ *i One example of a computing environment incorporating 

The subject matter which is regarded as the invention is and using the checkpoint/restart capabilities of the present 

particularly pointed out and distinctly claimed in the claims invention is depicted in FIG. la. Computing environment 

at the conclusion of the specification. The foregoing and 100 includes, for instance, a computing unit 101 having at 

other objects, features, and advantages of the invention will least one central processing unit 102. a main memory 104 

be apparent from the following detailed description taken in and one or more input/output devices 106, each of which is 

conjunction with the accompanying drawings in which: described below. 

RGS. la and lb depict examples of computing environ- As is known, central processing unit 102 is the controlling 

ments incorporating and using the checkpoint/restart capa- center of computing unit 101 and provides the sequencing 

bilities of the present invention; and processing facilities for instruction execution, interrup- 

FIG. 2 depicts one example of message packets and tion action, timing functions, initial program loading and 

information associated therewith, in accordance with the other machine related functions. 

principles of the present invention; jhe central processing unit executes at least one operating 

FIG. 3 depicts one example of various components of the system, which as known, is used to control the operation of 

memory depicted in FIG. la, in accordance with the prin- the computing unit by controlling the execution of other 

ciples of the present invention; programs, controlling communication with peripheral 

FIG. 4 depicts one embodiment of a memory layout of a devices and controlling use of the computer resources, 

process, in accordance with the principles of the present Central processing unit 102 is coupled to main memory 

invention; 104, which is directly addressable and provides for high 

RG. 5 depicts one example of registers, file descriptors speed processing of data by the central processing unit. Main 
and signal information associated with a process, in accor- 50 memory 104 may be either physically integrated with the 

dance with the principles of the present invention; CPU or constructed in stand-alone units. 

FIG. 6 illustrates one example of message communication Main memory 104 and central processing unit 102 are 

between a coordinating process and user processes, in accor- also coupled to one or more input/output devices 106. These 

dance with the principles of the present invention; devices include, for instance, keyboards, communications 

FIG. 7 depicts one embodiment of the logic associated 65 controllers, teleprocessing devices, printers, magnetic stor- 

with synchronizing checkpointing among various processes, age media (e.g., tape, disks), direct access storage devices, 

in accordance with the principles of the present invention; sensor based equipment, and other storage media. Data is 
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transferred from main memory 104 to input/oulpul deices "received" for all packets up to and including that sequence 
106, and from the input/output devices back to main number. Once the packet has been acknowledged, the slot 
memory. can be freed for reuse. Periodically, unacknowledged pack- 
In one example, computing environment 100 is a single ets are retransmitted. This is a standard technique for pro- 
system environment, which includes an RS/6000 computer 5 viding reliable transmission of messages over an imreliable 
system running an AIX operating system. (RS/6000 and AIX medium. 

are offered by International Business Machines Each process of a parallel program is loaded in the 
Corporation). In another example, computing environment memory of the computing unit that is to execute the process. 
100 includes a UNIX workstation running a UNIX-based This is depicted in FIG, 3. As one example, memory 104 
operating system. Other variations are also possible and are 10 includes one or more application processes 300. Each pro- 
considered a part of the claimed invention. cess may make library calls to various program libraries 302, 

Another embodiment of a computing environment incor- also loaded within the memory. One program library that is 

porating and using the checkpoint/restart capabilities of the called, in accordance with the principles of the present 

present invention is depicted in FIG. lb. In one example, a invention, is a checkpoint/restart library 304. Checkpoint/ 

computing environment 107 includes a plurality of comput- restart library 304 is called by each process that wishes to 

ing units 108 coupled to one another via a connection 110. use the checkpoint/restart capabilities of the present invcn- 

In one example, each unit is a RS/6000 computing node tion. In addition to the above, memory 104 includes a system 

running AIX, and the units are coupled together via a token kernel 306, which provides various system services to the 

ring or a local area network (LAN). Each unit includes, for application processes and the libraries, 

example, a central processing unit, memory and one or more ^° Memory 104 is further described with reference to FIG. 4, 

input/output devices. which depicts one embodiment of the memory layout for an 

In another embodiment, each unit is a UNIX workstation application process. In particular, for each process, memory 

running a UNIX-based operating system, and the units are 104 includes programming code 400, global variables 402 

coupled to one another via a TCP/IP connection. used by the process, a heap 404 (for dynamic memory 

In yet a further embodiment, the environment includes a allocation while the program is running), and a stack 406. 

large paraUel system with a plurality of units (e.g., 512 Theglobal variables and the heap are referred to as the "Data 

nodes) coupled to one another via a network connection, Section" of the process, which is distinct from the stack of 

such as a switch. The invention is not limited to a particular process. Each process running in the computing unit has, 

number of units coupled together. i° addition to its code, a separate portion of memory to store 

Tlie above embodiments are only examples, however. The '° ^^^^ Section and stack. This section is referred to as a 

capabiUties of the present invention can be incorporated and ^^er address space 408. In addition to the user address space, 

used with any type of computing environments or comput- ^^^^^^^^ ^ ^^'""^^ ^^^'^^ ^P^^ 

ing units (e.g., nodes, computers, processors, systems, ^® system kernel 

machines, and/or workstations), without departing from the 35 Each process running in the computing unit also has a 

spirit of the present invention. separate copy of registers 500 (FIG. 5), which includes a 

A computing unit of the present invention is capable of P°^°*^f ^02 and a program counter 504. Further, each 

executing both serial and parallel programs. However, it is P^^^,^ associated therewith various file descriptor and 

in the context of the parallel programs that the checkpoint/ signal information 506. 

restart capabilities of the present invention are described 40 1° accordance with the principles of the present invention, 

(although various aspects of the present invention are also «^ch individual process of the parallel program (that is 

applicable to serial programs). A parallel program includes participating in the checkpoint/restart capabilities of the 

one or more processes (or tasks) that are executed indepen- present invention) is responsible for taking its own check- 

dently. In one example, the processes of a parallel program Point and for restarting itself in the event of a failure, 

are coordinated by a coordinating process. The processes of 45 However, the timing of when the individual checkpoints are 

a parallel program communicate with each other and the to be taken by the user processes is the responsibility of a 

coordinating process by, for instance, passing messages back coordinating or master process. Communication between the 

and forth. In one example, a Message Passing Interface user processes and the coordinating process is iUustrated in 

(MPI), offered by International Business Machines F'G. 6. 

Corporation, is used to communicate between the various 50 Acoordinating process 600 receives messages initiated by 
processes. MPI is described in "IBM Parallel Environment user processes 602. In one example, coordinating process 
for AIX: MPI Programming and Subroutine Reference," 600 is the Parallel Operating Environment (POE) offered by 
IBM Publication No. GC23-3894-02 (August 1997), which International Business Machines Corporation. The user pro- 
is hereby incorporated herein by reference in its entirety. cesses send the messages to the POE via, for instance, a 
In one example, the messages to be sent are segmented 55 partition manager daemon (PMD) 604, which is also offered 
into packets 200 (FIG. 2) of some convenient maximum by International Business Machines Corporation as part of 
size, and copied into an anray of packet slots 201. A next the POE. The PMDs are also used by coordinating process 
sequence number 202 is assigned to the packet, and an to send messages to the user processes. POE and PMD 
acknowledge (ack) status 204 corresponding to the packet is are described in detail in "IBM Parallel Environment For 
set to unacknowledged. The sequence number and ack status 60 AIX: Operation and Use," Vols. 1&2, IBM Publication Nos. 
are stored in a parallel array 206 to the message packet slots. SC28-1979-01 (August 1997) and SC28-1980-01 (August 
The message packet is then sent across the communication 1997), which arc hereby incorporated herein by reference in 
subsystem (in-transit). The sequence number is sent along their entirety. 

with the message. At some subsequent time, the receiver of In particular, each user process sends a Checkpoint Ready 

the message sends an acknowledgment message recording 65 message (message 1), when it is ready to take a checkpoint, 

the highest consecutive sequence number received. This is and a Checkpoint Done message (message 3), when it has 

used by the sender of the message to set the ack status to completed taking the checkpoint. Likewise, the coordinating 
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process sends a Checkpoint Do message (message 2), when 
it is time for the processes to take a checkpoint, and a 
Checkpoint Commit message (message 4), when the user 
processes arc to commit to the new checkpoint. The use of 
these messages are further described below with reference to 
FIGS. 7 and 8. FIG. 7 describes processing from the point 
of view of a coordinating process and FIG. 8 describes 
processing from the point of view of a user process. 

Referring to FIG, 7, one embodiment of the logic asso- 
ciated with the checkpoint synchronization provided by the 
coordinating process is described in detail. Initially, two 
variables to be used during the synchronization process, 
referred to as READY and DONE, are initialized to zero, 
STEP 700. 

When a message is received by the coordinating process, 
STEP 702, a determination is made as to whether the 
message is a Checkpoint Ready message sent by a user 
process. INQUIRY 704. If the message is not a Checkpoint 
Ready message, then the coordinating process proceeds with 
other processing, STEP 706. and flow returns to STEP 702 
"RECEIVE MESSAGE". 

On the other hand, if the message is a Checkpoint Ready 
message sent by a user process that is ready to take a 
checkpoint, then the processing variable referred to as 
READY is incremented by one, STEP 708. Subsequently, a 
determination is made as to whether all of the participating 
processes of the parallel program are ready to take a check- 
point. In particular, a check is made to see whether READY 
is equal to the number of processes initiated (or indicated as 
participating) for the parallel program, INQUIRY 710. If 
there are still processes that have not sent the Checkpoint 
Ready message to the coordinating process, then the pro- 
cesses do not take the checkpoint yet, so flow returns to 
STEP 702 "RECEIVE MESSAGE". 

However, if the coordinating process has received a 
Checkpoint Ready message from each of the processes of 
the parallel program, then the coordinating process broad- 
casts a Checkpoint Do message to each of the processes, 
STEP 712. This indicates to the processes that each process 
can now take its checkpoint. 

After each process takes its checkpoint, it sends a Check- 
point Done message to the coordinating process indicating 
that it is done taking the checkpoint. When the coordinating 
process receives a message, STEP 714, it determines 
whether the message is the Checkpoint Done message, 
INQUIRY 716. If it is not a Checkpoint Done message, then 
the coordinating process continues with other processing, 
STEP 718, and flow returns to STEP 714 "RECEIVE 
MESSAGE". 

On the other hand, if the coordinating process has 
received a Checkpoint Done message from a user process, 
then the variable referred to as DONE is incremented by 
one, STEP 720. Thereafter, a determination is made as to 
whether all of the participating processes of the parallel 
program have completed taking their checkpoints. In 
particular, DONE is compared to the number of processes 
initiated (or indicated as participating) for the parallel 
program, INQUIRY 722. If DONE is not equal to the 
number of processes, then flow returns to STEP 714 
"RECEIVE MESSAGE". 

However, if each of the processes has sent a Checkpoint 
Done message to the coordinating process, then the coordi- 
nating process broadcasts a Checkpoint Commit message to 
aU of the processes, STEP 724. At this time, each of the 
processes can commit to the checkpoint just taken. This 
completes the checkpoint synchronization performed by the 
coordinating process. 
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Although it is the responsibility of the coordinating pro- 
cess to inform each of the user processes of when to take a 
checkpoint, it is the user process itself that takes the check- 
point. In one example, in order to take a checkpoint, a user 

5 process calls mp_chkpt( ) from the checkpoint/restart 
library. The logic associated with this library function is 
described below with reference to FIG. 8. 

When a user process is ready to take a checkpoint, it sends 
a Checkpoint Ready message to the coordinating process via 

10 the PMD, STEP 800. After it sends the message, it waits 
until it receives the Checkpoint Do message firom the 
coordinating process. 

Upon receiving the Checkpoint Do message from the 
coordinating process, STEP 802, the user takes a checkpoint, 

15 STEP 804. The checkpoint processing, which is described in 
further detail with reference to FIG. 9, copies the state of the 
process out to a checkpoint file on external storage media 
(e.g., disk). Subsequent to taking the checkpoint, the user 
process sends the Checkpoint Done message to the coordi- 

'^^ nating process, STEP 806. At this point, the user process 
stops processing until it receives the Checkpoint Commit 
message from the coordinating process. 

When the coordinating process receives a Checkpoint 
Done message from each of the participating user processes 
(e.g., the processes of the parallel program), it broadcasts the 
Checkpoint Commit message to each of the participating 
user processes. When the user process receives the commit 
message, STEP 808, it deletes any older version of the 
checkpoint file, STEP 810. Thus, there is only one check- 
point file for the process, which corresponds to the check- 
point that was just taken. This completes the processing 
initiated by a user process to take and commit checkpoint. 
Further details of taking a checkpoint, in accordance with 

2j the principles of the present invention, are described in detail 
with reference to FIG. 9. In one embodiment, in order to take 
a checkpoint, the user process stops message traffic (e.g., 
message passing interface (MPI) messages), STEP 900. In 
particular, it stops sending messages to other processes, 
including the coordinating process and other user processes, 
and it stops receiving messages from other processes. 

In addition to stopping the messages, the user process also 
blocks signals, STEP 902. This prevents an inconsistent state 
of the data. For example, if delivery of a signal was allowed, 

45 it opens the possibility that the parallel program has installed 
a signal handler which can interrupt the checkpointing 
operation and change the program state. This can then result 
in an inconsistent checkpoint. Thus, all signals are blocked 
during the taking of the checkpoint. 

50 After stopping the messages and blocking the signals, the 
file offsets of any open files of the process are saved to the 
Data Section (in memory) of the process, STEP 904. 
Additionally, the signal state is also saved to the Data 
Section, STEP 906. The signal state includes any pending 

55 signals received from other processes that have not been 
processed or any signal masks that define what to do if 
certain signals are received. The signal state information 
also includes information about the process' signal handlers. 
The saving of the file ofl'sets and the signal state to the Data 

60 Section ensures that the ofiFsets and signal slate can be found 
during a restore, since the entire Data Section is saved to the 
checkpoint file, as described below. (In another 
embodiment, the signal slate and file offiset information are 
stored directly in the checkpoint file (after it is opened), 

65 without first copying it to the Data Section.) 

Thereafter, a checkpoint file corresponding to the process 
is opened, STEP 908. The checkpoint file is located via, for 
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instance, environment variables that are used to define the 
envirorunent and are stored in memory. It is the environment 
variables that provide the name of the checkpoint file and the 
directory in which the checkpoint file is stored. Once the 
checkpoint file is opened, a checkpoint file header is written, 
which indicates that this is the start of the checkpoint file, 
STEP 910. 

Subsequently, additional information needed to restore 
the process in the event of a failure is written to the 
checkpoint file. For instance, executable information is 
written to the checkpoint file, STEP 912. This information 
identifies the program to be restarted, (Note that program- 
ming code is not written.) Further, the Data Section, which 
includes the file offsets and signal state, is also written to the 
checkpoint file, STEP 914. 

Additionally, any in-transit message data is also written to 
the checkpoint file, in accordance with the principles of the 
present invention, STEP 916. (In another embodiment, the 
in-transit message data may be contained within the Data 
Section.) In-transit message data includes, for instance, 
either an indication that there are no messages, or it includes 
those messages received by the process that have yet to be 
processed and messages sent from the process to other 
processes that the process has not received acknowledg- 
ments thereto. The in-transit message data is written to the 
checkpoint file without first writing it to a log file. (Other 
data written to the checkpoint file also need not be logged, 
prior to writing it to the checkpoint file.) 

The checkpointing of the message data saves, for 
instance, array of packet slots 201 (FIG. 2), as well as 
sequence number 202 and ack status 204. Since any unac- 
knowledged packet is eventually retransmitted, the in-transit 
messages can be recreated during restart. The sequence 
number assures that the receiver can screen for duplicates, 
since it is possible that the receiver has received the packet 
but has not acknowledged it yet. 

In addition to the above, any stack and register contents 
are also written to the checkpoint file, STEP 918, The stack 
includes data that is local to the subroutines that are being 
processed (one subroutine having made a call to another 
subroutine, etc., so that the local variables of all of the active 
subroutines are on the stack), as well as information needed 
to return from a called subroutine when that subroutine has 
completed. 

After the information needed to restore the process back 
to the state it was in when the checkpoint was taken is 
written to the checkpoint file, a determination is made as to 
whether this is the checkpoint process or a restart process, 
INQUIRY 920. In particular, the value of a return code is 
checked to see if it is equal to one, INQUIRY 920. Since this 
is the checkpoint process, the return code is equal to zero and 
processing continues with STEP 922. 

At STEP 922, a checkpoint file footer is written to the 
checkpoint file, and then, the checkpoint file is closed, STEP 55 
924, After the checkpoint file is closed, the signals are 
unblocked and MPl messaging is resumed, STEP 928. This 
completes the checkpoint process. 

In one embodiment, each participating process of the 
parallel program performs the checkpoint process. This 
ensures data integrity. One example of pseudocode used to 
perform the checkpoint process is depicted below. As can be 
seen, various Message Passing Interface (MPI) calls are 
made. The use of MPI is only one example, however. 
Communication between the user processes and between the 
user processes and the coordinating process can be by any 
means. 
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inp_chkpt( ) 
scndPOEMcssage(SSM_CHKPT_READY); 
recvPOEMessage(SSM_CHKPT_DO); 
mp_stopMPI( ); 
blocksignals( ); 
saveFileOffsets( ); 
savcSignallnfo( ); 
chlq)tFd-openChkptFile(chkptFile); 
writeChkptFileHeader(chkptFd); 
writeExecInfo(chkptFd); 
saveDataSegment(chkptFd); 
mp_saveMPIData(chkptFd); 
rc-saveStack(chkptFd); 

/* rc»0 during checkpoint; rc=l during restart */ 
if(rc=.l){ 

handleRestart ( ); 

return(l); 

} 

writeChkptFileFooter(chkptFD); 
closeChkptFile(chkptFd); 
unblockSignals( ); 
mp_resumeMPI( ); 

sendPOEMessage(SSM_CHKPT_DONE); 

r6cvP0EMessagc(SSM_CHKPT„C0MMIT); 

delete01dChkplFiles(chkptFile); 

Should a parallel program need to be restarted, each 
participating user process of the parallel program initiates a 
restart process. In one example, the restart process is initi- 
ated by calling mp_restart locate d within the checkpoint/ 
restart library. One embodiment of the logic associated with 
restart is describe d with reference to FIG. 10. 

Initially, environment variables are checked to determine 
whether the parallel program is being initiated or reinitiated, 
STEP 1000. In particular, a restart state stored as part of the 
environment variables is used to make this determination. If 
the process is not being restarted, INQUIRY 1002, then 
restart processing is complete, STEP 1004. 

However, if this is a restart, then information needed from 
the current environment is saved, STEP 1006. This infor- 
mation includes environment information that may have 
changed between the time the checkpoint was taken and the 
restart. For instance, the environment information may 
include the list of computing units running the processes, if 
one or more of the processes has been migrated to one or 
more new computing units. Additionally, other environment 
variables relating to the parallel process are also saved. This 
includes, for example, the Internet address of the other 
processes in the parallel program, whether or not to write 
output to a file or a computer terminal screen, information 
about the message passing network, and various other vari- 
ables that are only valid now, at restart time, of which the 
data in the checkpoint file is stale. 

After saving any needed variables from the current 
environment, the checlq)oint file is opened, STEP 1008. 
Subsequently, a determination is made as to whether this is 
the same parallel program that was running, INQUIRY 
1010. In one embodiment, this determination is made by 
comparing various execute information (e.g., checksum and 
other information) relating to the original application stored 
in the checkpoint file with corresponding information of the 
program trying to be restarted. 

If this is not the same program, then the restart process is 
complete. However, if it is the same program, then the 



12/14/2003, EAST Version: 1.4.1 



us 6,393,583 Bl 

11 12 

restoring of ihe process from the checkpoint file begins. nip_restan( ) 

During the restoration, the process is restored to the stale it saveSomeEnvironmentvariables( ); 

was in when the checkpoint was taken. chkptFd-openChkpiFile( ); 

In particular, the Data Section is restored by copying the restoreDataSegment(chkptFd); 

Data Section from the checkpoint file into memory of the 5 rcstorcFilcOflfsets( ); 

computing unit executing the process, STEP 1012. Then, restoreSignalInfo( ); 

from the Data Section, any file ofiGsets and signal state are mp_resoreMPIData(chlq)tFd); 

restored, STEPS 1014 and 1016. Additionally, any message restoreStack(chkptFd); 

data is restored by copying the data from the checkpoint file /♦ This longjmps to saveStack with rc=l, and hence 

to memory, STEPS 1018. invokes handleRestart. It also closes the checkpoint file. 

Further, the register contents and the stack contents are */ 

restored, STEP 1020, When performing user-level handleRestart() 

checkpoint/restart, there is difficulty in restoring the stack, restoreSomeEnvironmentvariables( ); 

This is because the process is running on its stack, while it unblocksignals( ); 

is performing its restart operation. Since the process is using mp_rcslartMPI( ); 

the stack during restart, it cannot safely overwrite its stack performPOESync( ); 

with the saved checlqioint stack. Thus, a temporary stack is iq accordance with the principles of the present invention, 

used. The temporary stack is allocated (up-front) in the Data the checkpoint files generated by the processes of a parallel 

Section of the process. The process switches to the tempo- 20 program are stored on either global or local storage. One 

rary stack by using, for instance, a setjmp/longjmp mecha- example of global storage is shown in FIG. 12. 

nism and by manipulating the stack pointer entry provided A global storage 1200 is storage that is not stored on the 

by the setjmp call. While running on the temporary stack, the computing units (i.e., it is external to the computing units.) 

original stack is restored. In particular the stack contents are II includes, for instance, a global file system, such as the 

copied from the corresponding checkpoint file to memory. 25 Global Parallel File System (GPFS), offered by International 

The above is also further described in detail in "Checkpoint Business Machines Corporation. The global storage is 

and Migration of UNIX Processes in the Condor Distributed coupled to a plurality of computing units or nodes 1202 of 

Processing System," by Todd Tanneobaum and Michael a node cluster. A process 1204 resides on a computing unit 

Litzkow, Dr. Dobbs's Journal, 227:40-48, February 1995; and takes a checkpoint by writing to a named checkpoint file 

and in "Checkpoint and Migration of UNIX Processes in the 30 for that process. The checkpoint file is stored on the global 

Condor Distributed Processing System", by Michael storage. Thus, the global storage is accessible by all the 

Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, processes of the node cluster. 

University of Wisconsin-Madison Computer Science Tech- The storing of the checkpoint files on global storage has 

nical Report #1346, April 1977, each of which is hereby the advantage of enabling the processes to be restarted on 

incorporated herein by reference in its entirety. 35 different computing units than they were previously execut- 

After restoring all of the needed information from the ing on. That is, the processes are able to migrate to different 

checkpoint file, the checkpoint file is closed, STEP 1022, computing units and still be restarted using their checkpoint 

and the return code is set to one, STEP 1024. Thereafter, a files. (In one embodiment, the type of hardware and level of 

long jump to the stack location that the process was at when operating system of the migrated machine are similar to the 

the stack contents were written is performed, STEP 1026. In 40 machine used to take the checkpoint.) Migration is shown in 

particular, this places the flow at INQUIRY 920 "RETURN FIG. 12. During the restart process, the same number of 

CODE =1?" in FIG. 9. computing units, or fewer or more computing units are used 

Since the return code is equal to one, the restart process- to restart the processes of the parallel program, 

ing continues. The environment variables saved in restart at Local storage may also be used to store the checkpoint 

STEP 1006 are restored, by replacing the old values that are 45 files. As shown in FIG. 13, when local storage is used, a 

stored in the Data Section that was restored at STEP 1012 process 1300 writes to a checkpoint file, corresponding to 

with the current values, STEP 1100 (FIG. 11). This provides that process, and the checkpoint file is stored on storage 

the current operating environment that the process is to run 1302 local to a computing unit 1304. However, if local 

in. storage is used, then the process needs to be restarted on the 

Subsequently, any signals that were blocked during 50 same computing unit that it was originally initiated on, 

checkpointing are unblocked, STEP 1102. That is, when the unless the checkpoint file is also moved (or the local storage 

signal state was restored in STEP 1016, certain signals were is shared with another computing unit via a hardware 

blocked, since that was the state of the processing when the connection such as a Serial Storage Adapter (SSA) loop 

checkpoint was taken. Thus, those signals are unblocked offered by International Business Machines Corporation), 

before the process continues to run. 55 Described in detail above are checkpoint/restart capabili- 

In addition to the above, MPl traffic is restarted in order ties employed by parallel programs. The techniques of the 
to allow message traffic to flow once again, STEP 1104. present invention advantageously enable each process of a 
Thereafter, conventional synchronization with the coordi- parallel program to internally perform checkpointing and 
naling process is performed, STEP 1106. This completes the restarting of the process. Further, it provides for the check- 
restart processing. 60 pointing of in-transit message data. The capabilities of the 

In one embodiment, the same number of processes are to present invention enable the checkpointing of in-transit 
be restarted as was processing before the checkpoint was message data (and other data) to occur without requiring that 
taken. The number of processes is available from the envi- the data first be logged to a log file. Additionally, the 
ronment variables. Thus, each of the processes runs the capabilities of the present invention provide for the corn- 
restart processing. 65 mitling of the checkpoint files, when checkpointing is done, 

One example of how to perform the restart operation so that there is only one checkpoint file for each process at 

using pseudocode is depicted below: restart time. The checkpoint files can be written to local or 
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global storage, and when ihey are written to global storage, 
migration from one computing unit to another is facilitated. 

The present invention can be included in an article of 
manufacture (e.g., one or more computer program products) 
having, for instance, computer usable media. The media has 5 
embodied therein, for instance, computer readable program 
code means for providing and facilitating the capabilities of 
the present invention. The article of manufacture can be 
included as a part of a computer system or sold separately. 

Additionally, at least one program storage device readable 
by a machine, tangibly embodying at least one program of 
instructions executable by the machine to perform the capa- 
bilities of the present invention can be provided. 

The flow diagrams depiaed herein are just exemplary. 
There may be many variations to these diagrams or the steps 
(or operations) described therein without departing from the 
spirit of the invention. For instance, the steps may be 
performed in a differing order, or steps may be added, 
deleted or modified. All of these variations are considered a 
part of the claimed invention. 

Although preferred embodiments have been depicted and 20 
described in detail herein, it will be apparent to those skilled 
in the relevant art that various modifications, additions, 
substitutions and the like can be made without departing 
from the spirit of the invention and these are therefore 
considered to be within the scope of the invention as defined 25 
in the following claims. 

What is claimed is: 

1. A method of checkpointing parallel programs, said 
method comprising: 

taking a checkpoint of a parallel program, said parallel 
program comprising a plurality of processes, and 
wherein said taking a checkpoint comprises: 
writing, by a process of said plurality of processes, 
message data to a checkpoint file corresponding to 
said process, said message data including an indica- 
tion that there are no messages, or including one or 
more in-transit messages between said process writ- 
ing the message data and one or more other pro- 
cesses of said plurality of processes. 

2. The method of claim 1, wherein said taking a check- 
point further includes writing, by a process of said plurality 40 
of processes, at least one of a data section, signal state and 
one or more file ofifeets to a checkpoint file corresponding to 
said process writing said at least one of said data section, 
said signal state and said one or more file offsets. 

3. The method of claim 1, wherein said taking a check- 45 
point further includes writing, by a process of said plurality 

of processes, at least one of executable information, stack 
contents and register contents to a checkpoint file corre- 
sponding to said process writing said at least one of said 
executable information, said stack contents and said register 50 
contents. 

4. The method of claim 1, wherein said writing of said 
message data to said checkpoint file is performed without 
logging said message data to a log file. 

5. The method of claim 1, wherein said checkpoint file is 55 
stored in local storage accessible by said process. 

6. The method of claim 1, wherein said checkpoint file is 
stored in global storage accessible by said plurality of 
processes of said parallel pn^gram. 

7. The method of claim 1, further comprising restoring 60 
said process that wrote said message data to said checkpoint 
file, wherein said restoring comprises copying said message 
data from said checkpoint file to memory of a computing 
unit executing said process. 

8. The method of claim 7, wherein said computing unit 65 
executing said process is a different computing unit from 
when said checkpoint was taken by said process. 



9. The method of claim 1, wherein said taking a check- 
point further comprises taking a checkpoint by a number of 
processes of said plurality of processes, wherein said taking 
a checkpoint by said number of processes comprises writing 
data to a number of checkpoint files, wherein each process 
of said number of processes lakes a conresponding check- 
point. 

10. The method of claim 9, ftirther comprising coordinat- 
ing the taking of said corresponding checkpoints by said 
nimiber of processes. 

11. The method of claim 10, wherein said coordinating 
comprises: 

sending a ready message from each process of said 
number of processes to a coordinating task indicating 
readiness to take said corresponding checkpoint; and 

providing, by said coordinating task to said each process, 
a message indicating that said corresponding check- 
point is to be taken, said providing occurring after 
receipt of said ready message from said each process, 

12. The method of claim 11, wherein said coordinating 
further comprises: 

sending a done message from said each process to said 
coordinating task indicating completion of said corre- 
sponding checkpoint; and 

forwarding, by said coordinating task to said each 
process, a commit message indicating that said corre- 
sponding checkpoint is to be committed, said forward- 
ing occurring after receipt of said done message from 
said each process. 

13. The method of claim 12, further comprising: 
committing, by each process of said number of processes, 

to said correspon^ng checkpoint; and 
deleting, by each process of said number of processes, any 
previous corresponding checkpoint information, after 
committing to said corresponding checkpoint. 

14. A method of checkpointing parallel programs, said 
method comprising: 

taking a checkpoint by a process of a parallel program, 

said taking a checkpoint comprising: 

writing to a data section of said process at least one of 
a signal state and one or more file offsets; 

subsequently, writing said data section to a checkpoint 
file corresponding to said process; 

writing message data to said checkpoint file, said 
message data including an indication that there are 
no messages, or including one or more in-transit 
messages between said process and one or more 
other processes of said parallel program; and 

writing at least one of executable information, stack 
contents and register contents to said checkpoint file. 

15. The method of claim 14, wherein said taking a 
checkpoint further comprises at least one of stopping mes- 
sage traffic of said process and blocking signals of said 
process, prior to writing to said data section. 

16. The method of claim 15, wherein said parallel pro- 
gram has a plurality of processes, and wherein said taking a 
checkpoint is performed by each process of said plurality of 
processes. 

17. The method of claim 16, further comprising restoring 
said parallel program, said restoring using the checkpoints 
taken by said plurality of processes, 

18. A method of restoring parallel programs, said method 
comprising: 

restarting one or more processes of a parallel program on 
one or more computing units, wherein at least one 
process of said one or more processes is restarted on a 
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dififerent computing unil from the computing unit that 
was previously used to lake at least one checkpoint for 
said at least one process; and 
copying data stored in one or more checkpoint files 
corresponding to said one or more restarted processes ^ 
into memory of said one or more computing units 
executing said one or more restarted processes, wherein 
said data restores said one or more restarted processes 
to an earlier state. 

19. The method of claim 18, wherein said one or more 
checkpoint files are stored in global storage accessible by 
said one or more computing units. 

20. A method of checkpointing parallel programs, said 
method comprising: 
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indicating, by a process of a parallel program, that said 
process is ready to take a checkpoint; 

receiving, by said process, an indication to take said 
checkpoint; 

taking said checkpoint, wherein said taking said check- 
point comprises having said process copy data from 
memory associated with said process to a checkpoint 
file corresponding to said process; and 

indicating, by said process, completion of said taking of 
said checkpoint. 

* * * « * 
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